Compact q-Gram Profiling of Compressed Strings
نویسندگان
چکیده
We consider the problem of computing the q-gram profile of a string T of size N compressed by a context-free grammar with n production rules. We present an algorithm that runs in O(N ↵) expected time and uses O(n+kT,q) space, where N ↵ qn is the exact number of characters decompressed by the algorithm and kT,q N ↵ is the number of distinct q-grams in T . This simultaneously matches the current best known time bound and improves the best known space bound. Our space bound is asymptotically optimal in the sense that any algorithm storing the grammar and the q-gram profile must use ⌦(n+ kT,q) space. To achieve this we introduce the q-gram graph that space-e ciently captures the structure of a string with respect to its q-grams, and show how to construct it from a grammar.
منابع مشابه
Algorithms and data structures for grammar - compressed strings
This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...
متن کاملData Structures for Grammar-compressed Strings
This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...
متن کاملFast q-gram Mining on SLP Compressed Strings
We present simple and efficient algorithms for calculating qgram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size n that represents string T , we present an O(qn) time and space algorithm that computes the occurrence frequencies of all q-grams in T . Computational experiments show that our algorithm and its variation are pract...
متن کاملQGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings
Q-gram (or n-gram, k-mer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional q-gram index structure for each string, or at least ...
متن کاملSpeeding Up q-Gram Mining on Grammar-Based Compressed Texts
We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP T of size n that represents string T , the algorithm computes the occurrence frequencies of all q-grams in T , by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size m = |T | − dup(q, T...
متن کامل